Fast estimation of posterior probabilities in 
change-point analysis through a constrained 

^ hidden Markov model 

o 

(N 
^ The Minh Luong* Yves Rozenholc, and Gregory Nuel 

^ MAPS, Universite Paris Descartes, 45 rue des Saints-Peres, 75006 

^ Paris, France 

< July 2012 

-(— > 

CN Abstract 
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■^ The detection of change -points in heterogeneous sequences is a statis- 

^ tical challenge with applications across a wide variety of fields. In bioin- 

■^ formatics, a vast amount of methodology exists to identify an ideal set of 

-y-^ change-points for detecting Copy Number Variation (CNV). While consider- 

O able efficient algorithms are currently available for finding the best segmen- 

^ tation of the data in CNV, relatively few approaches consider the important 

• • problem of assessing the uncertainty of the change-point location. Asymp- 

. !^ totic and stochastic approaches exist but often require additional model as- 

^ sumptions to speed up the computations, while exact methods have quadratic 

3 complexity which usually are intractable for large datasets of tens of thou- 

sands points or more. In this paper, we suggest an exact method for obtaining 
the posterior distribution of change-points with linear complexity, based on 
a constrained hidden Markov model. The methods are implemented in the 
R package postCP, which uses the results of a given change-point detection 
algorithm to estimate the probability that each observation is a change-point. 
We present the results of the package on a publicly available CNV data set 



*e-mail: the-minh.luong@parisdescartes.fr; Corresponding author 



{n = 120). Due to its frequentist framework, postCP obtains less conserva- 
tive confidence intervals than previously published Bayesian methods, but 
with linear complexity instead of quadratic. Simulations showed that postCP 
provided comparable loss to a Bayesian MCMC method when estimating 
posterior means, specifically when assessing larger-scale changes, while be- 
ing more computationally efficient. On another high-resolution CNV data 
set (« = 14, 241), the implementation processed information in less than one 
second on a mid-range laptop computer. 

1 Introduction 

The detection of change-points in heterogeneous sequences is a statistical chal- 
lenge with many applications in fields such as finance, reliability, signal analysis, 
neurosciences and biology ( Pinkel et al4|1998| Snijders et al4 2001 1. In bioinfor- 



matics in particular, a vast amount of methodology ( ,Qlshen et al4|2004||Fridlyand 



et al4 12004[ |Hupe et aLj |2004[ ) exists for identifying an ideal set of change-points 



in data from array Comparative Genomic Hybridization (aCGH) techniques, in 
order to identify Copy Number Variation (CNV). 

A typical expression of the change-point problem is as follows, given data X — 
(Xi,X2,...,X„) of real- valued observations, (5i,...,5„) corresponding segment 
indices of the observations, and ^k as the set of all possible combinations of 5 for 
fixed K ^2 number of segments. The goal is to find the best partitioning S E ^k 
into K non-overlapping intervals, assuming that the distribution is homogeneous 
within each of these intervals. 

For K segments of contiguous observations, the segment-based model ex- 
presses the distribution of X given a segmentation S e ^k as: 

P(X|5; 0) = n^e,, {Xd = flU Se, [Xi) (1) 

i=\ k=li,Si=k 

where go^{-) is the parametric, or emission, distribution (e.g.: normal or Poisson) 
of the observed data with parameter 0^, = (Oi,. . . , Ok) is the global parameter, 
and Si is the segment index at position i. For example, if n = 5, ^ = 2, and with 
change-point located between positions 2 and 3, then S = (1,1,2,2,2). 

Introducing a prior distribution P(5) on any S E Mk obtains a posterior distri- 
bution of the segmentation: 

^^^'^'^^"L^P(X|i?;0)P(i?)- ^^^ 



For a uniform prior, set ^^ = (^Z J = \^k\- 

A common alternative to the above segmentation procedure is to consider an 
unsupervised hidden Markov model (HMM). Assuming that S is a Markov chain 



of hidden states, this approach (Rabiner, 1989) can be thought of as being level- 
based, with the parameter of the k"^ segment takes its value in the set of L ^ 1 
levels J {01 , 02, • • • 5 0l}- This simply is equivalent to the model defined by Equa- 
tion ([l|, with the noticeable difference that S G {1,2, .. . ,L}". With this level- 
based approach ^ ^ L in general, and the HMM is unconstrained in the sense 
that transitions are possible between any pair of states. This is an appropriate 
model when the conditional distribution within a given segment of contiguous ob- 
servations may be shared among other segments. While the unconstrained HMM 
is preferable in many practical situations, the segment-based model requires less 
assumptions and is thus a more general model. 

A convenient feature of these HMM approaches is in computing efficiently 
the posterior distribution F{S\X;0) in 0{L^n) using classical forward-backward 
recursions ( Durbin et al.[ 1998| ), making them suitable for handling large datasets. 
This paper focuses on using an computationally efficient exact procedure to char- 
acterize the uncertainty P(S|X; 0) of the estimated change-point locations using a 
hidden Markov model adapted to the conventional segmentation model as previ- 
ously described. We exploit the effectiveness of the level-based HMM approach 
through a constrained HMM corresponding exactly to the above segment-based 
model, providing a fast algorithm for computing P(5|X; 0). 

We develop this posterior distribution procedure as change-point assessment 
in practical applications becomes more challenging from a computational point 
of view. For example, emerging high-throughput technologies are producing in- 
creasingly large amounts of data for CNV detection. For finding the exact pos- 



terior distribution of cha nge-points P(5|X; 0) , Guedon (2007) suggested an algo- 
rithm in 0{KrP-), while Rigaill et al. (2011) considered the same quantity in a 



Bayesian framework with the same complexity. However, the complexity of these 
approaches provides for very slow processing for large datasets with sequences 
of tens of thousands or more. Other estimates generally focus on asymptotic be- 



havior whose conditions are delicate due to the discreteness of the problem (Bai 



and Perron] |2003^ |Muggeo[ |2003|), on bootstrap techniques (Huskova and Kirch 



2008) and on stochastic methods such as particle filtering (Fearnhead and Clif- 



ford[|2003| ), recursive sampling ( jLai et al.[|2008| ), and Markov chain Monte Carlo 



'similar to the segment-based model, the choice of L is critical and is usually addressed 
through penalized criteria. 



([Erdman and Emerson[ |2008[). Furthermore, many of the faster stochastic algo- 



rithms assume normal error structure to speed up the estimation procedures and 



are thus more difficult to adapt to non-normal data ( |Lai et al.[ |2008[ [Erdman and 
Emerson! [2008] ) ■ 



Section|2]presents a summary of current change -point methods, the constrained 
HMM algorithm and a description of the accompanying R statistical package. Sec- 
tion [3] implements the methods on a published array CGH data set and compares 
with published results. Section |4]presents examples of simple simulated data sets 
with comparisons between methods. Section [5] shows an illustrative example of 
the methods on a larger scale SNP array data set, while Section [6] includes a dis- 
cussion. 



2 Methods 



2.1 Current change-point detection methods 

A considerable amount of literature focuses on detecting the ideal number or set 
of change-point locations. For CNV detection, [Fridlyand et al.[ ( [2004[ ) developed 
a discrete HMM to map the number of states and the most likely state at each 
position. Observations where state transitions are most likely to occur indicate 
change-points. Various extensions to this HMM approach include various proce- 
dures such as merging change-points ( Willenbrock and Fridlyand[ 2005 1 and spec- 
ifying prior transition matrices ( Marioni et al.[ 2006| ) to improve the results, and 



simultaneous identification of CNV across multiple samples (Shah et al. 2009). 
HMM-based implementations for dealing with higher-resolution data for current 
array technologies include those from [Colella et al.[ ( [2007[ ); |Wang et al.[ ( [2007[ ). 

A wide amount of segment-based approaches for identifying CNV in ge- 
nomics data have also been explored. [Olshen et"aL] ( |2004| ) introduced an extension 
of binary segmentation through a non-parametric approach, using permutation 
rather than parameter estimation to test for change-points. This algorithm requires 
data smoothing and pruning to improve computation time and change-point detec- 
tion for very large sequences. To speed up this computationally-intensive process. 



Venkatraman and Olshen (2007) introduced adjusted p- values and stopping rules 



for the permutations. Hupe et al. (2004) introduced a likelihood-based approach 
to estimate parameters in a normal Gaussian model after adaptive weights smooth- 
ing. Hsu et al.|( |2005| ) proposed a wavelet-based change-point detection procedure 



with data smoothing, while Filers and De Menezes (2005) introduced a quantile 



approach to smoothing. [Willenbrock and Fridlyand| ( [2005[ ); |Lai et al.| ( [2005] ) sum 



marized and compared the various segmentation approaches for aCGH data. 
To estimate the number of segments, /T, [Zhang and Siegmund ( 2007| ) extended 



an earlier method including a modified Bayes Information Criterion to adjust for 



the number of change-points using a recursive procedure. Comte and Rozen 



hole (2004) described a least squares minimization procedure with a penalization 



criterion for the number of contrasts, while Picard et al. ( |2005 ) implemented an 



adaptive method for estimating the location and number of change-points, with 
penalization terms for likelihoods. 

2.2 Constrained HMM 

Let us assume that 5 is a heterogeneous Markov chain over {1,2,. ..,K,K+ 1} 
(where ^ + 1 is a "junk" state only considered for consistency reasons) such as 
P(5i = 1) = 1 and with the following transitions: for all 2^ i ^n, and I ^k^K 
we have P(5,- = k\Si-i =k) = l- r\k{i) and P(5/ = fc+ l|5/_i =k)= r]k{i). For 
consistency, choose P(5;- = ^+l|5,_i =^-f 1) = 1. For example, ifn = 5,K — 2, 
and 5 =(1,1, 2,2,2) then P(5) = (1-77i(2))t]2(3)(1-t]2(4))(1- 772(5)). With 
this Markov chain, it is clear that {S E Mk\ = \3n = ^}- 

In the particular case where the Markov chain is homogeneous with T]yt(0 — 
r\ e]0, 1 [ for all k and /, P(5) = (1 - t])"-^?]^-! for all Se^xQln other words, 
only positive state jumps of +1 are possible. Therefore P(5|5 G ^k) = '^/\^k\, 
which corresponds to the canonical choice of a uniform prior on ^k- Note that 
a choice of different transition coefficients T]yt(0 allows us to specify informative 
priors. 

2.3 Forward-backward procedure and posterior probabilities 

The constrained HMM provides a framework for additional inference on the un- 
certainty in the estimated change-point model; in particular, after obtaining the 
segmentation from any previous procedure, we can obtain confidence intervals 
around each of the identified change-points. In terms of practical applications, 
this approach is helpful when dealing with situations where both very short and 
very long segments may be present, and the exact location of change-points may 
not be identifiable. 



^Note that the particular value of Tj does affect ¥{S) but has no effect whatsoever on ¥{S\S € 
^a") of the chosen segmentation S. We can therefore safely make an arbitrary choice hke 7] — 0.5 
for practical computations. 



The forward-backward algorithm (IRabinerl |1989[), also known as posterior 



encoding, is a recursive algorithm that can estimate the posterior probabilities of 
each observation i being in a particular hidden state 5,, and being a change-point 
such that Si ^ 5/_i. The algorithm is of complexity 0{Kn) for K segments and 
n observations; the sparse transition matrix between states reduces the 0{K^n) 
complexity of the classical forward-backward algorithm. A summary of this al- 



gorithm and other inferential procedures involved in HMM estimation is in Cappe 



etal. (2005). 



We define the forward and backward quantities as follows, for observation / 
and state k: 

For 1 ^ ?' ^ n — 1 : 

Fi{k)=F{Xi.,,=xi.,,,Si = k) (3) 

B,{k)=FiX,+ ,,„=X,+ Un:Sn=m=k) (4) 

We obtain the forward quantities by recursion through the following formulae: 
Forward: 

e. 
= [i^_i(;t)(l -77,(0) +l,>if;-_i(;t-l)77,(/)]ge,(-^.) (6) 

where gQ^^ {xi) is the density function of the chosen emission distribution g with 
parameter Q]^, when k is the underlying segment for observation /. 

We use a similar recursive procedure to obtain the backward quantities: 
Backward: 

f ^K{n)gQ^^{xn) iik = K-\ 

B,^y{k) = \ {\-^K{n))gQ^{Xn) \ik = K (7) 

[ else 

5.-1 W = jy{Si = ^\Si-i =KSE ^K)ge,{xi)B,{i) 

e 

= {\-T]k{i))g0,{xi)Bi{k) + lk<Krik+iii)ge,+M)Bi{k+\) (8) 
To obtain the posterior probabilities of the state 5,- = k ai position i, we note 



that: 

n^l:n=Xl:n,SeJ^K) =Fi(l)5i(l) (9) 

The constrained HMM estimates the probability of changing state while being 
at state 5,- = kat observation i as: 

¥{Si = k\Xi:„=xu„,Si-i =k-\,SeJlK) = ^ 7-. TV ■ (11) 

We can sample a vector of length ^ — 1 change-points from the original data set 
through these quantities, and afterwards generate a new vector of observed data 
through the chosen emission distributions. The posterior probability of the k''^ 
change-point occurring after observation i, or in other words / + 1 being the first 
observation in the k+ V^ segment, is: 



¥{CPk = i\Xi.,n=Xl;n,S e JIk) = P(5/ = k,S,+ l =k+l\Xi.,n=Xl:n,S G ^k) 

(12) 



Fi(k)rik{i+\)ge,^,{xk+i)B,+i{k+r 



Fi(l)5i(l 

The accuracy of these posterior probabilities relies on the ability of the preced- 
ing change-point model to provide proper initial estimates of the number of and 
locations of the change-points, as they are crucial in selecting the 6k parameters 
used in the forward-backward procedures. 

Maximum a posteriori estimation of the most probable set of change-points 
is also possible through the Viterbi algorithm ( Viterbi[ 1967| ) by modifying the 



forward quantities, with the Viterbi quantities chosen as Vi{k) for observation / 
and state k. 

Vi{l)=g0^{xi), 

Viil)=V,^iil)il-Jli{i))ge,ixi),i{i>2,k=l 

Viik) =max{V,^i{k-\)r]k{i).V,^iik)il-r]kiimge,ixi), if i,k> 2 

We trace the path of indices k used to calculate Vi^k to get the set of change- 
points with highest posterior probability: 

• K-l"" change-point CPk i : ? of V, (i^ - 1 ) used to calculate V;+ 1 {K) 

• k*^ change-point CPk'. i of y,(A:) used to calculate V,+i(fc+ 1) (where CPk < 

CPk+i). 
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2.4 Statistics package postCP 

We apply the preceding methods in the statistics package postCP, available on the 
CRAN website 



http : //cran . r-pro j ect . org/web/packages/postCP, 



The forward and backward recursive algorithms are programmed in C++ to opti- 
mize the speed of the computationally-intensive process. 

The following is a typical R command line for segmenting a sequence LRR, 
with a vector of length K—l change-points (or last index of segments) initseg, and 
95% confidence intervals. The options also save forward-backward and posterior 
change-point probabilities in the output, obtain the most likely set of change- 
points through the Viterbi algorithm, and generate 100 different vectors (each 
of length ^ — 1) of change -point locations for generating data according to our 
change-point model. 

postCP (data=LRR , seg=initseg , ci=0 . 95 , viterbi=TRUE , nsamples= 100) 

The package also provides documentation for options on specifying a matrix of 
log-densities corresponding to any specified distribution of the data, and the prior 
specification of a matrix of transition probabilities. 



3 Detecting copy number variation in array CGH 
breast cancer data 

We apply the methods to a widely referenced data set from cell line BT474 from a 
breast cancer tumor ( [Snijders et al.[|200T] ). The data consist of log-reference ratios 



signifying the ratio of genomic copies of test samples compared to normal. The 
goal is to segment the data into segments with similar copy numbers, with change- 
points corresponding to copy number aberrations pointing to possible genes of 
interest ( Pinkel et al.[ 1998| ). We compare the results of postCP to the Bayesian 



confidence intervals previously published on the same data by Rigaill et al. (201 1 1, 
consisting of 120 observations from chromosome 10. The observations are sorted 
according to their relative position along chromosome 10. 



We use a modification of the greedy least squares ^-means algorithm (!Har- 



tigan and Wong[ 1979) to obtain an initial segmentation for 3 and 4 segments 



(Table [T]). The locations refer to the indices of the last observations of the first 
^ — 1 segments. Afterwards, we used postCP to obtain estimates of the forward 



and backward matrices in Section 2.3[ and afterwards, estimates of the posterior 
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(a) Three segments 



(b) Four segments 



Figure 1: Plots of estimated posterior change-point probabilities for chromo- 



some 10 data (Snijders et al. 2001). Dots are LRR data with horizontal lines 



representing the estimated means within segments scaled to the probability axes, 
(a) Posterior probabilities for three segments, first change-point initialized after 
/ = 68 is a solid line, 2nd initialized after / = 96 is a dashed line, (b) Posterior 
probabilities for four segments, additional change-point after i = 80 is a dotted- 
dashed line. 



probabilities of the change-points being at each observation. We assumed a ho- 
moscedastic normal model for the observations. 



Figure 1(a) displays the estimated posterior change-point probabilities of the 
aCGH data for three segments and Figure [T(b)] for four segments. For both seg- 
mentations, the probability of the last change-point had a sharp peak close to 
1.0. There is an additional change-point found at position 80 by the four segment 
model with a relatively high uncertainty. In both the 3- and 4-segment instances, 
the shapes of the change-point distributions are irregular due to the discreteness 
of the segmentation procedure. 

At these predefined observations, the corresponding posterior change-point 
probabilities were higher than reported by Rigaill, and confidence intervals were 
slightly narrower. This is expected, as Rigaill's method uses a Bayesian frame- 
work that accounts for the uncertainty in the parameter estimates. In particular, 
the rightmost change-point estimation with both 3 and 4 segments had posterior 
probabilities above 0.95, while the corresponding posterior probabilities from the 



Table 1: Change-point confidence intervals aCGH chromosome 10 
CP # A mean Est location 95 % CI (postCP) 95% CI (Bayesian) 







Three segments 




1 


-0.22 


68 66-76 


64-78 


2 


-0.71 


96 96-96 
Four segments 


92-97 


1 


-0.34 


68 66-76 


66-78 


2 


-0.20 


80 79-85 


78-97 


3 


-0.80 


96 96-96 


91-112 



Change -point estimates of aCGH data for chromosome 10 from Snijders et al 



(2001), with (95% confidence interval), hy postCP and Bayesian confidence 



intervals by Rigaill et al. (201 1 1. Narrower confidence intervals were found by 
postCP. 



Bayesian method were closer to 0.5. 

We also can use postCP to obtain a joint sample of an entire set of change- 
points from the original data using the constrained HMM model. Using postCP, 
we sampled 10, 000 data sets using (11). As seen in Figure[2} the histogram of the 



generated locations for the first two change-points closely approximates the pos- 
terior change-point probabilities found by postCP. Though there is little overlap 
between the estimated confidence intervals for the first two change-points, there 
was still a small but non-zero empirical Pearson correlation (r = 0. 123,/> < 0.001) 
between the simultaneously generated first and second change-point locations. On 
the other hand, there was no association between the third generated change-point 
and the other two change-points, an expected result since their confidence inter- 
vals do not overlap. 

4 Change-point data simulations and comparisons 

In this section we consider different change-point situations to compare the ability 
of the methods in assessing uncertainty. In this section, we use a loss function to 
assess the ability of the change-point methods in estimating the posterior mean of 
each observation {i = 1 , . . . , n) . For normally distributed data, we quantify loss 
through the mean square error (MSE). We define MSE = E(^(0 ~ %)^' where 
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Figure 2: Histogram of locations of the first two change-points, randomly and 
simultaneously generated by postCP for 10,000 sets, for data from |Snijders et alT 
(pool ). The frequencies of the first change-point from the simulations are in white 
boxes, of the second in gray boxes. The estimated change-point probabilities by 
the forward-backward algorithm for the first change-point are in white dots, for the 
second change-point in black. The randomly generated change-points are close to 
those estimated by postCP. 
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6si are the true underlying means for each observation. For non-normal data, we 
use the median absolute error (MAE) where MAE = Y!i=\ I (^ (0 ~ % I ■ 

We obtain 4 different estimates for the posterior mean of each of the n obser- 
vations as follows. The first two posterior mean estimates are from the Bayesian 
MCMC method implemented in the bcp package ( Erdman and Emerson[|2008 ) 



and the circular binary segmentation method from the DNAcopy package ( [Olshen 



et al. 2004; Venkatraman and Olshen 2007 1 in R, both at the default parameters. 



We also use postcp, with initial change-points from the estimated change-point 
set found by DNAcopy, to estimate posterior probabilities of being in state k for 
each observation /, P{Si = k\Xi-,i = xi:„,S G ^k), and segment means %;fc = 
I,... ,K. The posterior mean of observation i is: 

0(0 = £^(5, =fc|Xi;„ =.^i:„,5 G J^K-A)Ok. 
k=\ 

To obtain another initial set of change-points for postCP, we use another 
greedy procedure for quick model selection as follows. First, we use the greedy 
least-squares minimizing algorithm for a fixed K number of segments to obtain 
an initial set of change-points. We then use the Viterbi algorithm to get the most 
probable set of change-points based on this initial set, and obtain the correspond- 
ing maximum likelihood estimates (MLEs) for the model parameters. The esti- 
mated number of segments K is that which maximizes the Bayesian information 



criterion (BIC) ( |Schwarz| [19781 |Picard et all [2005] ), 



BIC{K)=LL-K' login), 

with respect to the number of segments, where LL is the log-likelihood of the 
data, and K' is the number of parameters in the model. Posterior quantities are as 
previously described. 

We simulated a sequence of n = 500 observations with 6 change-points after 
/ = (22,65, 108,219,252,435). For normally distributed data, odd segments had 
mean Oq = 0.0 and even segments had various values of 0i with standard deviation 
1.0. For Poisson data, odd segments had mean Oq = 1.0. We implement these 
procedures on 1000 sets each of normal and Poisson distributed data for these 4 
methods. Results in Table |2]report the MSE for normal data and MAE for Poisson 
data for different parameter values of 6i . 

For normally distributed data and n = 500, bcp had lower MSE than the postcp 
estimates when the intersegmental differences, or differences between the even 
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Table 2: Average error of different change-point methods 







n 


= 500 






n = 10,000 




00 01 


cbs 


cbs 
+postCP 


greedy 
+postCP 


bcp 


cbs 


cbs 
+postCP 


greedy 
+postCP 


bcp 








normal distribution 








0.0 0.25 


0.017 


0.017 


0.017 


0.016 


0.014 


0.013 


0.014 


0.013 


0.50 


0.058 


0.055 


0.054 


0.045 


0.021 


0.018 


0.025 


0.015 


0.75 


0.083 


0.074 


0.080 


0.051 


0.020 


0.016 


0.025 


0.015 


1.00 


0.068 


0.055 


0.074 


0.052 


0.018 


0.014 


0.022 


0.015 


1.25 


0.055 


0.043 


0.053 


0.051 


0.018 


0.014 


0.022 


0.015 


1.50 


0.050 


0.039 


0.043 


0.047 


0.017 


0.013 


0.021 


0.014 


1.75 


0.049 


0.039 


0.040 


0.045 


0.017 


0.013 


0.020 


0.015 


2.00 


0.047 


0.037 


0.037 


0.043 


0.015 


0.012 


0.019 


0.014 


2.25 


0.045 


0.036 


0.036 


0.041 


0.015 


0.012 


0.018 


0.015 


2.50 


0.042 


0.034 


0.034 


0.039 


0.014 


0.011 


0.017 


0.017 








Poisson distribution 








1.0 2.0 


0.225 


0.225 


0.154 


0.188 


0.0438 


0.045 


0.051 


0.062 


3.0 


0.112 


0.112 


0.119 


0.196 


0.0451 


0.045 


0.050 


0.057 


4.0 


0.114 


0.116 


0.122 


0.178 


0.0469 


0.047 


0.050 


0.055 


5.0 


0.122 


0.123 


0.126 


0.175 


0.0477 


0.046 


0.051 


0.057 


6.0 


0.123 


0.123 


0.126 


0.171 


0.0519 


0.051 


0.052 


0.066 


7.0 


0.136 


0.137 


0.138 


0.180 


0.0529 


0.053 


0.055 


0.080 


8.0 


0.137 


0.136 


0.138 


0.175 


0.0497 


0.049 


0.053 


0.091 


9.0 


0.139 


0.140 


0.140 


0.181 


0.0540 


0.054 


0.056 


0.114 


10.0 


0.149 


0.149 


0.148 


0.187 


0.0546 


0.055 


0.058 


0.150 


11.0 


0.154 


0.154 


0.153 


0.194 


0.0548 


0.055 


0.055 


0.175 



Error of posterior means for different change-point uncertainty methods, in terms 
of Mean Square Error (MSE) for normally distributed observations and Median 
Absolute Error (MAE) for Poisson distributed observations. 
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and odd segments, were less than 1.0 standard deviations. This may be due to 
the frequentist nature of postcp using point estimates of change-point location 
to calculate MLEs, which may not have been accurate at lower intersegmental 
differences. However, at larger intersegmental differences, the methods relying on 
postCP actually had lower MSE than bcp. In particular, for larger intersegmental 
differences, the Bayesian method at the default parameters of bcp were overly 
conservative. 

For illustrative purposes. Figures 3(a) and |3(b) compare the true values of 



the underlying means to those found from greedy+postCP and from bcp in one 
specific set of generated data. For true intersegmental differences of A = 0i — 
00,= 1-0, bcp had slightly lower MSE than postCP. Inaccurate locations from 
the greedy algorithm and BIC resulted in a greater amount of error from postCP 
compared to the relatively conservative posterior change-point probabilities from 
bcp. However, at A = 2.0, postCP performed slightly better than bcp, as the 
greedy algorithm and BIC provided the correct number of segments and accurate 
initial change-point estimates to postCP. 

For Poisson distributed data, the postCP based methods had lower MAE than 
bcp, showing the flexibility of the postCP procedure which can adapt to non- 
normal data. 



Figures 4(a) and 4(b) compare the true values of the underlying means for one 
set of Poisson data, for intersegmental differences of 2.0 and 4.0, respectively. It 
is evident that postCP provides more appropriate posterior mean estimates as bcp 
is not adapted for non-normally distributed data, with highly irregularly- shaped 
curves for the posterior means. 

We also repeat the procedure on a larger sequence of 10,000 observations, 
with 39 change-points randomly selected from a uniform distribution within these 
observations, such that no segment is less than 25 observations long. While there 
is less deviation in the posterior means due to larger segment sizes, the problem 
of estimating the number and location of change-points is made more difficult. 
For normal data using postCP after initialization by cbs showed similar patterns 
to those from n = 500, obtaining lower MSE than bcp other than for very small 
intersegmental differences. However, nsmg postCP after the greedy algorithm and 
BIC did not fare as well. Due to the larger amount of change-points, a more effec- 
tive change -point location estimator than greedy least squares is recommended in 
these situations. 

The exact postCP algorithms also provided some computational advantages in 
terms of model selection over the Bayesian MCMC method bcp, which also has 
linear complexity. For n = 500, the entire model selection procedure averaged 

14 








true 

greedy+postCP 

■bcp;_ 


#• 





100 200 300 

position 

(a) 01-00-1.0 



400 



500 



true 

^reedy+postCP 
bep- . 



r ' .'• 



o - »=?i 






100 200 300 400 500 

position 

(b) 01 - 00 = 2.0 

Figure 3: Plot of sample data n = 500 from normal distribution in Table [2j Un- 
derlying means in thin solid black lines, posterior means estimated by greedy+cp 
in dashed blue lines and by bcp in thick dash-dotted black lines, (a) Interseg- 
mental differences of 1.0 SD, MSB for bcp 0.043, for greedy+postcp 0.066. The 
first change-point is not located precisely (at / = 50) by the greedy method, with 
an extra change-point at i = 299 from BIC. bcp provided more conservative esti- 
mates slightly closer to true values in these regions, (b) Intersegmental differences 
of 2.0 SD, MSB for bcp was 0.053, for-^^^reedy+postcp was 0.040. More precise 
greedy+postcp estimates after correct first change -point was entered. 
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Figure 4: Plot of sample data n = 500 from the Poisson distribution used in Ta- 
ble [2} Underlying means in thin solid black lines, posterior means estimated by 
greedy+cp in dashed blue lines and by bcp in thick dash-dotted black lines. In 
both situations, the Bayesian estimates from bcp provided irregularly shaped pos- 
terior mean curves, (a) 9q = I and 0i = 3, median absolute error for bcp was 
0.205, for greedy+postcp was 0.172. (b) 0o = 1 and Oi = 5, MSB for bcp was 
0.180, for greedy+postcp was 0.175. 
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0.07 seconds with the greedy+postcp model selection procedure, 0.09 seconds for 
cbs+postcp and 0.9 seconds for bcp. For n = 10, 000, the respective runtimes were 
1.37 seconds for cbs+postcp together and 15.7 seconds for bcp. 

5 Detecting copy number variation in SNP array col- 
orectal cancer data 

To illustrate the efficiency of the constrained HMM model, we apply the meth- 
ods to a copy number variation profile obtained through current SNP (single nu- 
cleotide polymorphism) microarray technology ( Staaf et aL] 2008). The data con- 



tain 261,563 SNPs across the genome obtained from an Affymetrix chip from 
a colorectal cancer tumor cell line ( De Roock et al.[ 2010| ) with intensities pro- 



viding information regarding mutations, specifically duplications and deletions. 
After normalization, the log of the intensities in the tumor sample were compared 
to normal values, obtained from a reference sample, to obtain the log-reference 
ratios (LRR) across each of 23 chromosomes. 

We use the circular binary segmentation (CBS) procedure implemented in a 
widely used segmentation package, DNAcopy (Olshen et al.[ |2004^ |Venkatraman 



and Olshen] 2007) for the statistics software R. This package obtains an initial es- 



timate of change-points in LRR, for each of the 23 chromosomes in tumor sample 
103. We ran DNAcopy, after smoothing, at the default parameters of a = 0.01. 
Table [3] displays the best segmentation found within for 14, 241 SNPs from chro- 
mosome 10. This chromosome was selected since both small (second) and large 
(tenth) segments were found, with adjacent segment differences ranging from 0.04 
to 0.58 (Figure|5]). 

We ran postCP using the set of change-points identified by DNAcopy as ini- 
tial segmentation to obtain estimates of the posterior probabilities of the change- 
points being at these ten locations. We assumed a homoscedastic normal model 
for the observations. The forward-backward algorithm was practically instan- 
taneous (less than 0.1 seconds) for this sequence of over 14,000 observations 
on a mid-range dual-core 2.5 GHz, 4GB RAM laptop PC. Not surprisingly, the 
most narrow confidence intervals, and most precise change-point estimates, were 
found for larger differences (Table |4]). Of the ten estimated change-points, six 
separated segments whose means differed by greater than one standard deviation; 
these points all were found to have posterior change-point probabilities greater 
than 0.5. Eight of the ten change-points from DNAcopy had the highest posterior 
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Figure 5: Plot of 14,241 log-reference ratios (LRR) for chromosome 10. Hor- 
izontal lines are the estimated means within segments, vertical lines are the ten 
change-points estimated by DNAcopy. 
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Table 3: Segmentation from CBS for chromosome 10 
Segment Start End Size Mean 



1 


1 


211 


211 


0.031 


2 


212 


215 


4 


-0.552 


3 


216 


273 


58 


-0.028 


4 


274 


383 


110 


-0.322 


5 


384 


736 


353 


0.060 


6 


737 


3091 


2355 


-0.021 


7 


3092 


3102 


11 


-0.477 


8 


3103 


8308 


5206 


-0.011 


9 


8309 


8760 


452 


0.064 


10 


8761 


12383 


3623 


-0.011 


11 


12384 


14241 


1858 


0.031 



Best segmentation results from circular binary segmentation algorithm (Olshen 



et al. 2004). Information from ^=11 segments, with index of start and end of 
segment, size of the segment, and the mean within the segment. The pooled 
standard deviation across the segments was a = 0.188. 
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Table 4: 



Change 


CP 


Post 


A 


Size of confidence interval 


point 


Est 


Prob 


Mean 


ci:0.5 0.6 


0.7 


0.8 


0.9 


1 


211 


0.973 


-0.582 


1 1 


1 


1 


1 


2 


215 


0.918 


0.523 


1 1 


1 


1 


1 


3 


273 


0.556 


-0.293 


1 2 


2 


2 


3 


4 


383 


0.580 


0.381 


1 2 


2 


2 


3 


5 


736 


0.028" 


-0.081 


31 37 


46 


53 


61 


6 


3091 


0.860 


-0.456 


1 1 


1 


1 


2 


7 


3102 


0.880 


0.466 


1 1 


1 


1 


2 


8 


8308 


0.050 


0.075 


21 30 


50 


91 


132 


9 


8761 


0.064 


-0.075 


15 21 


48 


63 


78 


10 


12383 


0.006* 


0.042 


233 336 


412 


501 


522 



Information for 10 change-points identified within chromosome 10. The 
posterior probability of a change was at least 0.50 for all change-points with a 
moderate change (> a = 0. 188), while wider confidence intervals were found 
when segment differences were smaller. ": point 721 had slightly higher 
change-point probability (0.031),*: point 1 1943 had slightly higher change-point 
probabihty (0.008). 



change-point probabilities for their positions, with the exception of the fifth and 
tenth change-points whose probabilities were slightly lower than the respective 
maximum of their position. 

Figure |6(a) displays the posterior change-point probabilities for the first five 
change-points. Given the much narrower segment lengths, and greater segment 
differences, the probabilities are much higher for the first four change-points than 
the fifth change -point, whose 90% confidence interval was greater than 60 SNPs 



wide. Figure 6(b) displays the posterior change-point probabilities for the tenth 
change-point, whose 90% confidence interval was greater than 500 SNPs wide. 
Note that the shapes of the change-point distributions are also highly irregular. 



6 Discussion 



A common point of interest in current genomics studies is to find genetic muta- 



tions pointing to phenotypes susceptible to diseases (RedonetaL 2006) such as 
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Figure 6: Plot of estimated posterior change-point probabilities. Dots are LRR 
data with horizontal lines the estimated means within segments scaled to the prob- 
ability axes, (a) For first five change-points: The first four change-point estimates 
(SNPs at position: 211,215,273,383) are precise, however there is a wide un- 
certainty around the fifth change-point (position: 736). (b) For tenth change- 
point: There is a very wide confidence interval for this change-point estimate. 
The change-point estimate from the CBS algorithm (position 12,383) is the fourth 
rightmost peak. 
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cancers or Type II diabetes. The use of change-point analysis for detecting copy 
number variation (CNV) is a critical step in the characterization of DNA, includ- 
ing tumoral genes associated with cancer. A CNV may aid in locating a genetic 
mutation such as a duplication or deletion in a cancerous cell that is a target for 
treatment. However, due to the often large nature of genomics data sequences it 
is often difficult to rehably identify CNVs. The latest technologies for detecting 
CNV, such as single nucleotide polymorphism (SNP) arrays, are able to produce 
sequences of tens or hundreds of thousands of observations. 

This paper describes a procedure that extracts further useful information from 
segmenting large sequences such as those found in CNV detection in SNP array 
data. The unsupervised HMM, which is a level-based approach used in many 
current implementations, includes forward-backward calculations in linear time. 
However they are less flexible than segment-based methods, in part because they 
require prior transition probabilities between states that assume a geometric prior 
on segment lengths, which may be inappropriate in some situations. On the other 
hand, most segment-based methods can allow for a wider variety of specified 
priors, but in many situations are unfeasible due to their quadratic complexity for 
large data sets such as those generated by SNP arrays. 

The described constrained HMM model's complexity of 0{Kn) rather than 
0{KrP-) enables it to handle very large data sets in reasonable time. The procedure 
allows for the estimation of the full joint distribution of change-points conditional 
to the observations, allowing for practical estimation of confidence intervals. In 
high-resolution data sets with change-points to be unlikely detected at the exact 
correct position, the confidence intervals may yield important information. In SNP 
array technology, the change-points need to be detected with high-precision and 
the differences between segments may not be very large. Additionally, overlap- 
ping confidence intervals across several different cell tumors or lines may identify 
associated copy number variations, and help in identifying similar disease pheno- 
types and treatments in patient subgroups. 

The emission distribution gQ {x) of the observed data can also be easily spec- 
ified in this constrained HMM procedure, unlike other implementations specifi- 
cally developed for normally distributed large-scale data. The ability of postCP to 
easily adapt to a wide variety of distributions through the specification of the emis- 
sion distribution is especially important with next-generation sequencing plat- 
forms often producing data that closely follows the Poisson or Negative Binomial 
distributions ( |Shen et alT||2011| ). 



The described methods are a useful tool in the segmentation of large-scale 
sequences such as those involving CNVs, specifically when combined with any 
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current implementation designed for detecting an ideal set of change-points. For 
example, a fast dynamic programming algorithm of order 0{n\ogn) developed by 



Comte and Rozenholc (2004) obtaining an initial segmentation of the points is to 
be bundled in the postCP package. While the constrained HMM approach is sen- 
sitive to its starting point and tends to provide a local rather than global optimum 
if the initial starting points are misspecified, simulations in this paper showed that 
in practice, postCP provided estimates of posterior means that were comparable 
to, or sometimes better than, a currently implemented Bayesian MCMC method. 
This occurred in situations in which the Bayesian method was specifically de- 
signed, that is for assessing multiple change-points in normally distributed data. 

A planned extension of our method is to combine the segment-based and level- 
based approaches by limiting possible values of segment parameters (0i, . . . , Ok) 
to L < K different values. Further practical applications include more complex 
models to simultaneously model multiple datasets, or accounting for values from 
multiple patients and samples as random effects in a mixed model. Another prac- 
tical extension is to handle multi-dimensional output, such as current technology 
for copy numbers ( |Staaf et aL 2008 1 that simultaneously includes LRR and base- 



line allelic frequency (BAF) data. 
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